Low-contrast detectability and potential for radiation dose reduction using deep learning image reconstruction—A 20-reader study on a semi-anthropomorphic liver phantom

Background A novel deep learning image reconstruction (DLIR) algorithm for CT has recently been clinically approved. Purpose To assess low-contrast detectability and dose reduction potential for CT images reconstructed with the DLIR algorithm and compare with filtered back projection (FBP) and hybrid iterative reconstruction (IR). Material and methods A customized upper-abdomen phantom containing four cylindrical liver inserts with low-contrast lesions was scanned at CT dose indexes of 5, 10, 15, 20 and 25 mGy. Images were reconstructed with FBP, 50% hybrid IR (IR50), and DLIR of low strength (DLL), medium strength (DLM) and high strength (DLH). Detectability was assessed by 20 independent readers using a two-alternative forced choice approach. Dose reduction potential was estimated separately for each strength of DLIR using a fitted model, with the detectability performance of FBP and IR50 as reference. Results For the investigated dose levels of 5 and 10 mGy, DLM improved detectability compared to FBP by 5.8 and 6.9 percentage points (p.p.), and DLH improved detectability by 9.6 and 12.3 p.p., respectively (all p < .007). With IR50 as reference, DLH improved detectability by 5.2 and 9.8 p.p. for the 5 and 10 mGy dose level, respectively (p < .03). With respect to this low-contrast detectability task, average dose reduction potential relative to FBP was estimated to 39% for DLM and 55% for DLH. Relative to IR50, average dose reduction potential was estimated to 21% for DLM and 42% for DLH. Conclusions: Low-contrast detectability performance is improved when applying a DLIR algorithm, with potential for radiation dose reduction.


Introduction
Computed Tomography (CT) has become an essential tool in modern clinical medicine [1,2]. With widespread availability, a rapid increase in the use of CT imaging has been observed over the last decades [3]. With the associated increase in radiation exposure, the potential increased risk for radiation-induced malignancy has become a public health concern [4]. In general, the benefit of dose reduction is offset by deterioration of image quality. Thus, technological advances to reduce radiation dose without compromising image quality are aspired in clinical practice.
In CT-image reconstruction, filtered back projection (FBP) has been the dominant image reconstruction technique since the early 1970s, complemented by the first commercial iterative reconstruction (IR) algorithms in 2009 [5,6]. Although demonstrated potential for dose reduction [7][8][9], recent concerns have been made that rigorous application of IR may cause a decline in low-contrast detectability due to a shift in image noise texture, particularly when radiation dose is reduced below a certain threshold [10]. Multireader phantom studies have demonstrated that IR algorithms can preserve low-contrast detectability for only modest levels of dose reduction (up to approximately 25%) [11][12][13][14]. In a clinical setting, patient studies have shown that iterative reconstruction does not improve performance at moderate levels of dose reduction and that performance is deteriorated at higher levels of radiation dose reduction [15,16].
A novel deep learning image reconstruction (DLIR) algorithm received clinical approval in 2019 (TrueFidelity, GE Healthcare, Milwaukee, WI). Other vendor-specific algorithms for deep learning image reconstruction are also developed (AiCE, Canon Medical Systems, Otawara, Japan). As explained by a technical white paper [17], having been trained with high-dose and low-dose FBP datasets across phantom and patient cases, the DLIR algorithm strives to suppress image noise without compromising image quality. The use of deep learning image reconstruction has demonstrated potential for improved image quality [18][19][20] and image noise reduction without shifting noise texture [21][22][23]. Although mathematical observer models suggest improved object detection accuracy using deep learning reconstruction [23,24], the human observer diagnostic performance on low-contrast detection across dose levels has, to our knowledge, yet to be explored in a controlled phantom setting.
On this basis, the purpose of this study was to assess the low-contrast diagnostic performance and potential for radiation dose reduction when applying a DLIR algorithm using a customized semi-anthropomorphic upper-abdomen phantom.

Materials and methods
Institutional review board oversight was not required in this phantom-only study.

Phantom design
A customized semi-anthropomorphic upper-abdomen phantom (The Phantom Laboratory, Salem, NY) with a water equivalent diameter of 29.8 cm [25] containing four epoxy inserts with low-contrast lesions of different size and density compared to surrounding background material was applied in this study (Fig. 1).

Image acquisition and reconstruction
The phantom was scanned on a 16-cm multidetector CT scanner (GE Revolution; GE Healthcare; Milwaukee, WI), with scan parameters listed in Table 1, applying a range of tube currents considered clinically relevant [26]. To increase data for statistical analysis, the phantom was scanned three times for each dose level where inserts were interchanged and rotated freely about the z-axis between these scans to avoid lesion location recall bias among readers. All series were reconstructed with FBP, 50% hybrid IR (IR50; ASiR-V 50%), and DLIR of low strength (DLL; TrueFidelity Low), medium strength (DLM; TrueFidelity Medium) and high strength (DLH; TrueFidelity High) with 2.5 mm slice thickness. For reconstruction of the DLIR images, raw data were sent to the DLIR vendor (GE Healthcare) as the study was initiated prior to clinical implementation of the DLIR algorithm which is now commercially available.  1 Tube potential of 120 kVp chosen based on phantom size and density.

Reader interface and image interpretation
Twenty readers with variable level of experience (range 1-25 years) voluntarily agreed to participate in reading sessions; ten radiologists, eight medical physicists and two radiographers. Readers were unaware of study design, reconstruction algorithm applied, number of lesions and lesion configuration (i.e., size, density and location). Readings were performed independently without time constraint.
To assess low-contrast detectability, a series of consecutive twoalternative forced choice tests were performed applying a tailored script in the MATLAB environment (version 2018a; Mathworks, Natick, MA); (Fig. 2). Readers were presented with two cropped CT-images of a specific insert, one image containing hypo-or hyperdense lesions (the signal-present image) and one image containing background noise obtained from a homogenous part of the insert (the signal-absent image). Readers were then instructed to select the image most likely to contain lesions. Images were viewed on a diagnostic display in a clinical reading room with ambient light conditions and a constant window setting with window width of 150 HU and window level of 75 HU. The signal-present image was randomly selected to be on the left or right side of the screen, and all series were presented in a random order. Two different signalpresent images were extracted from each scan. Thus, each reader was presented with a total of 600 pairwise image comparisons; 2 signalpresent images x 3 scans x 4 inserts x 5 dose levels x 5 reconstruction algorithms. Finally, to reduce the potential effect of inhomogeneity in background noise mimicking a lesion, the signal-absent image was obtained from two separate z-locations of the homogenous part of the insert.

Detectability and dose reduction potential
Detectability was defined as the fraction of correctly selected images containing lesions, and was assessed separately for each dose level and reconstruction algorithm across readers. The dose reduction potential of the DLIR algorithm was defined as the reduction in dose level in which detectability was equal to that of FBP or IR50. To estimate detectability at dose levels not explicitly evaluated in the study setup, observer data was fitted to a mathematical model using the least squares approach in the Matlab software. With the assumption of detectability of 0.5 with dose level approaching zero (i.e., randomly guessing) and converging towards 1 with increasing dose level, the observer scores were modeled according to the equation: where D is detectability, erf is the error function, d is the dose level, and α and β are constant fitting parameters [9]. In this mathematical model, detectability is modelled to 0.5 when dose level, d, approaches zero, as erf(0) = 0. With increasing dose level, the modelled detectability converges towards 1, as erf(∞) = 1. Dose reduction potential was estimated separately for each strength of DLIR, and estimated separately for each investigated dose level. Subgroup analyses were performed comparing radiologist to non-radiologist reader scores.

Statistical analysis
Comparisons of reader detectability scores across reconstruction algorithms were performed applying a pairwise student's t-test at the 95% significance level under the null hypothesis that there was no difference in detectability between reconstruction algorithms. Adjusted odds ratios (OR) for correctly selecting the signal-present image (i.e., with lowcontrast lesions) were estimated using mixed logistic regression. For this, reconstruction algorithm, dose level and lesion type were included as fixed effects whereas reader and the combination of scan number and CT image slice were random effects. Model estimates were computed using R statistical software (version 3.0.4, R Foundation for Statistical Computing, Vienna, Austria, (https://www.r-project.org/) applying the lme4 package [27].

Results
Example images of inserts across dose levels and reconstruction algorithms containing small and large lesions are presented in Fig. 3A and Fig. 3B, respectively. From the ensemble of 20 readers, a total of 12000 data points were obtained. One response was discarded due to reader misclassification, and thus 11,999 data points were included for the final analyses.
Detectability at an aggregate level for each reconstruction algorithm and subgroup analysis per insert type is presented in Fig. 4. Performance ranged from approximately 50% (i.e., random guess) for the insert with small lesions imaged at 5 mGy dose level to 100% for the insert with large lesions imaged at 25 mGy dose level. When comparing the DLIR algorithms to FBP, detectability was significantly improved for images reconstructed with DLM and DLH for all dose levels (all p < .003). The largest improvement in detectability was observed for the lower dose levels, where DLM improved detectability by 6.9 p.p. at the 5 mGy dose level (95% CI 2.1-11.7 p.p., p = .007) and 5.8 p.p. at the 10 mGy dose level (95% CI 2.3-9.4 p.p., p = .003). Correspondingly, DLH improved detectability by 9.6 p.p. at the 5 mGy dose level (95% CI 3.5-15.6 p.p., p = .004) and 12.3 p.p. at the 10 mGy dose level (95% CI 8.1-16.5 p.p., p < .001). Compared to IR50, detectability was significantly improved for images reconstructed with DLH for the 5, 10 and 15 mGy dose level (p < .03), with the highest observed difference at the 10 mGy dose level (+9.8 p.p. improvement in detectability, 95% CI 4.8-14.8 p.p., p = .001). An overview of differences in detectability per dose level and reconstruction algorithm is presented in Table 2.
Results from mixed logistic regression are presented in Table 3. Adjusted for lesion type and dose level, odds ratio (OR) for correctly selecting the signal-present CT image was significantly improved for all strengths of DLIR compared to FBP, with OR estimated to 1   Dose reduction potential with FBP as reference was estimated to an average of 20% for DLL (range 17-22%), 39% for DLM (range 36-42%) and 55% for DLH (range 53-58%); (Table 4). With IR50 as reference, dose reduction potential was estimated to 21% for DLM (range 19-22%) and 42% for DLH (range 41-43%). Fig. 5 illustrates the dose reduction potential of DLH based on the detectability scores for DLH and IR50 across the investigated dose levels.
When comparing radiologists (n = 10) to non-radiologists (n = 10), detectability scores were at an aggregate level slightly higher for radiologists (+2.9 p.p.). However, there was no trend for radiologists performing better for a specific dose level or reconstruction technique (Supplementary Table 1). Dose reduction potential using radiologist scores only and with FBP as reference was estimated to an average of 18%, 34% and 53% for DLL, DLM and DLH, respectively. With IR50 as reference, dose reduction potential was estimated to an average of 21% and 38% for DLM and DLH, respectively.

Discussion
This tailored phantom study demonstrates that CT images reconstructed with a DLIR algorithm can improve low-contrast detectability relative to conventional FBP and hybrid IR, and that this improvement in detectability is dependent on algorithm strength. Based on reader detectability scores, average dose reduction potential for DLIR of high strength was estimated to 55% compared to FBP and 42% compared to 50% hybrid IR.
Fueled by increased availability, a rapid increase in the use of CT imaging has been observed over the last decades and the associated increase in radiation exposure has become a public health concern [3,4]. Although CT provides tremendous benefit to patient care when used for appropriate indications, efforts to keep radiation dose delivered to the patient as low as reasonably achievable have been widely endorsed (i.e.,   Estimated dose level with comparable detectability performance and implied dose reduction potential for hybrid IR and DLIR of various strengths relative to FBP based on a mathematical model fitted to low-contrast detectability observer data. DLH = deep learning image reconstruction of high strength, DLL = deep learning image reconstruction of low strength, DLM = deep learning image reconstruction of medium strength, FBP = filtered back projection, IR50 = 50% hybrid iterative reconstruction. the ALARA principle) [28]. However, decreasing radiation dose is traditionally offset by a deterioration in image quality, driven by an increase in image noise [29]. Detection and characterization of focal liver lesions can be a challenging diagnostic task in abdominal imaging, with decisive implication for patient care in an oncological setting [30]. This low-contrast task is highly sensitive to image noiseespecially if suboptimal dose levels have been applied during image acquisition [30,31]. Recent reports have shed light on potential disadvantages of IR algorithms where dose reduction potential may be limited by deterioration in image quality primarily affecting low-contrast tasks, such as obscuring small liver lesions [10]. Thus, novel methods to reduce CT image noise are aspired in clinical practice in pursuit of dose reduction. Several studies have investigated the qualitative image quality of DLIR algorithms. Phantom studies have demonstrated that a DLIR algorithm can achieve robust noise reduction without deteriorating changes in image noise texture, although demonstrating somewhat degraded low-contrast spatial resolution [21,22]. A recent phantom study by Racine al. investigating low-contrast detectability using channelized Hotelling observer as a surrogate marker for human observers estimated the dose reduction potential to 25% for DLM to and 33% for DLH compared to 60% hybrid IR [24]. The current study is thus an important supplement, demonstrating improved low-contrast detectability performance by human observers. In a clinical setting, the DLIR algorithm has demonstrated improved perceived overall image quality when compared to standard IR at standard dose abdominal CT [19,20]. A study on images of pediatric patients reconstructed with a DLIR algorithm from a different vendor (AiCE, Canon Medical Systems) found that significant dose reduction was possible without sacrificing image quality [23]. A study on low-dose abdominal CT applying the same vendor-specific algorithm found that deep learning reconstruction improves overall image quality and lesion detection compared to FBP and IR images [32]. It should be noted that the AiCE algorithm is trained with model-based IR images, and results are not necessarily transferrable across vendors. Thus, it is important to take into account how the different algorithms are trained when evaluating and comparing vendor-specific DLIR algorithms.
Interestingly, radiologist detectability scores were significantly, albeit only slightly, higher than non-radiologists. Subgroup analysis of results based in radiologist scores were important as radiologists assess the images in clinical practice. However, selecting a signal-present image in a two-alternative forced choice approach does not take into account the diagnostic considerations a radiologist is expected to take when assessing focal liver lesions. Thus, studies incorporating detectability in combination with assessment of diagnostic confidence in a clinical setting with clinicopathological correlation are important supplements to this study in further exploring the dose reduction potential of deep learning based CT image reconstruction algorithms.
This study is not without limitations. First, only one specific scanner and one vendor-specific DLIR algorithm was applied, and results are not necessarily transferrable across vendors. Second, this study only compared one level of IR where several blends of hybrid-IR could have been explored. However, there is a practical balance between the number of metrics explored in such a human observer setting. This study was designed to explore different strengths of DLIR, and compare with standardly applied hybrid IR at our institution in addition to FBP as this vendor-specific DLIR algorithm is trained on FBP datasets. Third, this study lacks validation in a clinical setting, although the use of a semianthropomorphic upper-abdomen phantom containing lesions with a variation in size arguably mitigates this limitation.
In conclusion, this study demonstrates that low-contrast detectability is improved when applying a novel DLIR algorithm, and that this improvement is dependent on algorithm strength. The dose reduction potential based on detectability performance for DLIR of high strength was estimated to 55% relative to FBP and 42% relative to hybrid IR. The results in this study can serve as a basis for clinical studies to further investigate the diagnostic performance and dose reduction potential of DLIR.

Funding sources
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: This study is part of ongoing research at the Oslo University Hospital CT Research and Technology Group. Oslo University Hospital has institutional research agreements with GE Healthcare and The Phantom Laboratory, among others. The authors of this article had complete control of data that might have presented a conflict of interest throughout the study period, and the decision to publish has been at the sole discretion of the authors. The authors state no individual conflicts of interests.

Acknowledgements
The authors thank Johann Baptist Dormagen, Mats Engelhardtsen, Fig. 5. Plot of average low-contrast detectability scores for images reconstructed with deep learning image reconstruction of high strength (DLH) and filtered back projection (FBP);(A) and 50% hybrid iterative reconstruction (IR50);(B) for each investigated CT dose index (CTDIvol). Curves fitted to observer data allow for estimation of the potential reduction in dose level for DLH while maintaining comparable level of detectability to FBP and IR50 for each investigated dose level (dashed lines).